[SPARK-18206][ML]Add instrumentation for MLP,NB,LDA,AFT,GLM,Isotonic,LiR #15671

zhengruifeng · 2016-10-28T10:19:32Z

What changes were proposed in this pull request?

add instrumentation for MLP,NB,LDA,AFT,GLM,Isotonic,LiR

How was this patch tested?

local test in spark-shell

SparkQA · 2016-10-28T10:59:43Z

Test build #67700 has finished for PR 15671 at commit 181e1b2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-28T11:55:56Z

Test build #67703 has finished for PR 15671 at commit ad707d2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-28T12:06:36Z

Test build #67704 has finished for PR 15671 at commit 6fabd70.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-28T13:03:46Z

Test build #67706 has finished for PR 15671 at commit 7c74241.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-10-28T15:18:30Z

mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala

@@ -96,11 +96,13 @@ class DecisionTreeClassifier @Since("1.4.0") (

    val instr = Instrumentation.create(this, oldDataset)
    instr.logParams(params: _*)
+    instr.logNumClasses(numClasses)


These are already logged inside of RandomForest.run

zhengruifeng · 2016-10-29T03:00:30Z

@sethah Thanks. I have revert those changes in Tree-Algos.

SparkQA · 2016-10-29T04:02:46Z

Test build #67741 has finished for PR 15671 at commit 05e2f4d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2016-10-31T04:38:49Z

@jkbradley @yanboliang Could you please make a reivew?

sethah · 2016-10-31T15:18:30Z

Minor: would you mind changing the title to "Add instrumentation logs to ML training algorithms". I initially thought this PR was adding them to the old MLlib API, so I think it's a bit more clear.

sethah

Left mostly minor comments. Let's create a JIRA for other meta algorithms like CrossValidator, TrainValidationSplit, and OneVSRest. Thanks for doing this!

sethah · 2016-10-31T14:52:54Z

mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala

@@ -234,8 +234,14 @@ class MultilayerPerceptronClassifier @Since("1.5.0") (
   * @return Fitted model
   */
  override protected def train(dataset: Dataset[_]): MultilayerPerceptronClassificationModel = {
+    val instr = Instrumentation.create(this, dataset)
+    instr.logParams(params : _*)


In general, we are trying to log all params that are useful, but not params that could overload logs. Since MLP has a initialWeights parameter that could potentially be very large, we should not include it here.

ok, I will update it here, and other algos which support a initalModel

sethah · 2016-10-31T14:57:33Z

mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala

@@ -216,7 +221,10 @@ class NaiveBayes @Since("1.5.0") (

    val pi = Vectors.dense(piArray)
    val theta = new DenseMatrix(numLabels, numFeatures, thetaArray, true)
-    new NaiveBayesModel(uid, pi, theta).setOldLabels(labelArray)
+    val model = new NaiveBayesModel(uid, pi, theta).setOldLabels(labelArray)
+


nit: remove empty line

sethah · 2016-10-31T15:12:12Z

mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala

@@ -896,6 +896,10 @@ class LDA @Since("1.6.0") (
  @Since("2.0.0")
  override def fit(dataset: Dataset[_]): LDAModel = {
    transformSchema(dataset.schema, logging = true)
+
+    val instr = Instrumentation.create(this, dataset)
+    instr.logParams(params : _*)


Likewise here we probably should not log docConcentration.

sethah · 2016-10-31T15:15:00Z

mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala

-    copyValues(newModel).setParent(this)
+
+    val model = copyValues(newModel).setParent(this)
+    instr.logSuccess(newModel)


minor: instr.logSuccess(model)

sethah · 2016-10-31T15:26:24Z

mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala

@@ -226,6 +226,10 @@ class AFTSurvivalRegression @Since("1.6.0") (@Since("1.6.0") override val uid: S
    val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
    val numFeatures = featuresStd.size

+    val instr = Instrumentation.create(this, dataset)
+    instr.logParams(params : _*)


maybe we should not log quantileProbabilities? I am not familiar with this algorithm so I'm not sure if these can ever be too large.

sethah · 2016-10-31T15:30:14Z

mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala

@@ -276,6 +280,7 @@ class AFTSurvivalRegression @Since("1.6.0") (@Since("1.6.0") override val uid: S
    val intercept = parameters(1)
    val scale = math.exp(parameters(0))
    val model = new AFTSurvivalRegressionModel(uid, coefficients, intercept, scale)
+    instr.logSuccess(model)


minor: move the logSuccess call after the copyValues. That way if we ever do something in logSuccess like logging parameters of the model, the call will reflect the copied parameters.

Also, would you mind changing the doc in Instrumentation.logSuccess from:

"Logs the successful completion of the training session and the value of the learned model."

to

"Logs the successful completion of the training session."

Since we currently don't do anything other than log a string inside logSuccess.

sethah · 2016-10-31T15:41:23Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

@@ -284,6 +288,7 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
        .setParent(this))
    val trainingSummary = new GeneralizedLinearRegressionTrainingSummary(dataset, model,
      irlsModel.diagInvAtWA.toArray, irlsModel.numIterations, getSolver)
+    instr.logSuccess(model)
    model.setSummary(trainingSummary)


minor: move logSuccess after setSummary

sethah · 2016-10-31T15:42:58Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

@@ -284,6 +288,7 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
        .setParent(this))
    val trainingSummary = new GeneralizedLinearRegressionTrainingSummary(dataset, model,
      irlsModel.diagInvAtWA.toArray, irlsModel.numIterations, getSolver)
+    instr.logSuccess(model)


We need to log success if we enter the WeightedLeastSquares branch above as well. Maybe we can refactor the code to use if else instead of an if with a return. That way we only need to log success in one place.

sethah · 2016-10-31T15:54:01Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

@@ -392,6 +399,8 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String
      model,
      Array(0D),
      objectiveHistory)
+
+    instr.logSuccess(model)


minor: move logSuccess after setSummary, here and elsewhere.

zhengruifeng · 2016-11-01T06:58:39Z

@sethah Thanks for your reviewing. I have made changes according to your comments. And I will create JIRAs for meta algos.

SparkQA · 2016-11-01T08:02:34Z

Test build #67878 has finished for PR 15671 at commit df8734e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-11-01T19:09:00Z

I just made a subtask for the algs covered here. Can you please link this PR to SPARK-18206 in the title instead of the umbrella task?

sethah · 2016-11-01T21:34:53Z

So, if we consistently use instr.logParams(params: _*) (even in cases where it's acceptable) then we run the risk of adding some param in the future that could "overload" the logs (like initialModel). However, if we manually select the appropriate params to log, then we risk adding some other param in the future which we do want to log, but it never gets added. Both could be problematic.

For now, I think I lean towards manually selecting which params to log rather than logging all params. If we add more params later we will have to remember to add them to the logging. What are others' thoughts?

Alternatively, we could use a filter on certain types of params - like Array[_]. Or we could pass in specific params to exclude in the function call.

jkbradley · 2016-11-02T02:53:13Z

Good point. I agree with you about logging selected Params only.

zhengruifeng · 2016-11-02T03:10:18Z

@sethah @jkbradley OK, I will link this pr to the subtask and change algos which use instr.logParams(params: _*) to explicitly select some params.

SparkQA · 2016-11-02T04:31:09Z

Test build #67947 has finished for PR 15671 at commit 6d2d13f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-11-02T17:50:30Z

IMO, the way we're doing this logging right now is unsustainable. It requires too much manual work. We can leave this discussion for a different JIRA, but what we could do is modify the Instrumentation class to just truncate the param value string after a certain number of characters. Then, we could even modify Predictor.fit to create Instrumentation and log all params. The only thing we'd have to do in the individual algorithms is log anything else - algorithm specific - that we want to add. I haven't tested this yet. @jkbradley @zhengruifeng what do you think?

sethah

A few minor comments otherwise LGTM. Still, we definitely need to find a way to make this more sustainable.

sethah · 2016-11-02T18:07:01Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

-      return model.setSummary(trainingSummary)
+      model.setSummary(trainingSummary)
+      instr.logSuccess(model)
+      return model


would you mind rewriting this if statement to:

val model = if (familyObj == Gaussian && linkObj == Identity) { ... } else { ... } instr.logSuccess(model) model

Then we can avoid the code duplication?

OK, I will change this.

sethah · 2016-11-02T18:21:11Z

mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala

@@ -227,7 +227,8 @@ class AFTSurvivalRegression @Since("1.6.0") (@Since("1.6.0") override val uid: S
    val numFeatures = featuresStd.size

    val instr = Instrumentation.create(this, dataset)
-    instr.logParams(params : _*)
+    instr.logParams(labelCol, featuresCol, censorCol, predictionCol, quantilesCol,


add aggregationDepth

sethah · 2016-11-02T18:23:30Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

@@ -196,7 +196,8 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String
    }

    val instr = Instrumentation.create(this, dataset)
-    instr.logParams(params : _*)
+    instr.logParams(labelCol, featuresCol, weightCol, predictionCol, solver, tol,


add aggregationDepth

zhengruifeng · 2016-11-03T01:38:40Z

@sethah I have make changes according to the comments. Thanks for your reviewing.

zhengruifeng · 2016-11-03T01:51:24Z

@sethah I agree that manually listing traceable params is prone to mistake. I think we can log all params expect some params which are labeled dont-log in the individual algorithms. Or we can create a new methods def logParams(params: Seq[Param], except: Seq[Param]).

SparkQA · 2016-11-03T02:45:54Z

Test build #68039 has finished for PR 15671 at commit e77bdc4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-11-03T15:08:17Z

I created SPARK-18253 to track it. We may have to get to it after 2.1 QA period.

jkbradley · 2016-11-03T18:03:07Z

I don't want to truncate Param strings because it would create invalid JSON in case people want to try to catch and parse the logs. I like the idea of allowing exceptions and possibly adding unit tests to ensure the logs do not blow up.

zhengruifeng · 2017-01-05T04:45:25Z

@jkbradley Update according to your comments, including adding quantileProbabilities and docConcentration.

SparkQA · 2017-01-05T05:47:50Z

Test build #70901 has finished for PR 15671 at commit c8693d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-09T07:15:10Z

Test build #71064 has finished for PR 15671 at commit e6b4615.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2017-01-09T08:10:20Z

ping @jkbradley

jkbradley · 2017-01-13T02:32:39Z

Thanks for the updates!

For docConcentration and quantileProbabilities, I agree it could be problematic if these are too large. How about:

We don't log docConcentration since that could easily be large and since it is less likely to cause errors which users will interpret as bugs.
We log quantileProbabilities.size since a large array could cause problems.

SparkQA · 2017-01-13T04:23:21Z

Test build #71285 has finished for PR 15671 at commit c8188b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2017-01-13T06:06:20Z

@jkbradley Updated. Thanks for reviewing!

zhengruifeng · 2017-01-16T08:36:50Z

re-ping @jkbradley

jkbradley · 2017-01-17T22:08:53Z

LGTM pending one more run of the tests
Thanks a lot @zhengruifeng !

Jenkins test this please

SparkQA · 2017-01-17T23:27:41Z

Test build #3537 has finished for PR 15671 at commit c8188b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-01-17T23:39:10Z

Merging with master
Thanks!

…,LiR ## What changes were proposed in this pull request? add instrumentation for MLP,NB,LDA,AFT,GLM,Isotonic,LiR ## How was this patch tested? local test in spark-shell Author: Zheng RuiFeng <[email protected]> Author: Ruifeng Zheng <[email protected]> Closes apache#15671 from zhengruifeng/lir_instr.

sethah reviewed Oct 28, 2016

View reviewed changes

zhengruifeng force-pushed the lir_instr branch from 7c74241 to 05e2f4d Compare October 29, 2016 02:58

sethah reviewed Oct 31, 2016

View reviewed changes

zhengruifeng changed the title ~~[SPARK-14567][ML]Add instrumentation logs to MLlib training algorithms~~ [SPARK-14567][ML]Add instrumentation logs to ML training algorithms Nov 1, 2016

zhengruifeng force-pushed the lir_instr branch from 05e2f4d to df8734e Compare November 1, 2016 06:56

zhengruifeng changed the title ~~[SPARK-14567][ML]Add instrumentation logs to ML training algorithms~~ [SPARK-18206][ML]Add instrumentation logs to ML training algorithms Nov 2, 2016

zhengruifeng force-pushed the lir_instr branch from df8734e to 6d2d13f Compare November 2, 2016 03:28

sethah reviewed Nov 2, 2016

View reviewed changes

zhengruifeng force-pushed the lir_instr branch from 6d2d13f to e77bdc4 Compare November 3, 2016 01:37

zhengruifeng added 10 commits January 5, 2017 11:23

revert trees

e2a1546

list params; correct logSuccess place; update comment of logSuccess

867f08b

select params in gbt and rf

d129416

fix glr,aft,lir

e62a688

fix style

19828e1

update als

6e27321

fix conflict

6d48759

fix glr

c172efc

fix style

00b91fd

update

c8693d8

zhengruifeng force-pushed the lir_instr branch from 3e54b1f to c8693d8 Compare January 5, 2017 04:44

Merge branch 'master' into lir_instr

e6b4615

zhengruifeng added 2 commits January 13, 2017 11:18

log quantileProbabilities.size

7f00d36

del docConcentration

c8188b0

asfgit closed this in e7f982b Jan 17, 2017

zhengruifeng deleted the lir_instr branch January 18, 2017 01:42

actuaryzhang mentioned this pull request Jan 19, 2017

[SPARK-18929][ML] Add Tweedie distribution in GLM #16344

Closed

[SPARK-18206][ML]Add instrumentation for MLP,NB,LDA,AFT,GLM,Isotonic,LiR #15671

[SPARK-18206][ML]Add instrumentation for MLP,NB,LDA,AFT,GLM,Isotonic,LiR #15671

Conversation

zhengruifeng commented Oct 28, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Oct 28, 2016

SparkQA commented Oct 28, 2016

SparkQA commented Oct 28, 2016

SparkQA commented Oct 28, 2016

Choose a reason for hiding this comment

zhengruifeng commented Oct 29, 2016

SparkQA commented Oct 29, 2016

zhengruifeng commented Oct 31, 2016

sethah commented Oct 31, 2016

sethah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhengruifeng commented Nov 1, 2016

SparkQA commented Nov 1, 2016

jkbradley commented Nov 1, 2016

sethah commented Nov 1, 2016 • edited Loading

jkbradley commented Nov 2, 2016

zhengruifeng commented Nov 2, 2016

SparkQA commented Nov 2, 2016

sethah commented Nov 2, 2016

sethah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhengruifeng commented Nov 3, 2016

zhengruifeng commented Nov 3, 2016

SparkQA commented Nov 3, 2016

sethah commented Nov 3, 2016

jkbradley commented Nov 3, 2016

zhengruifeng commented Jan 5, 2017

SparkQA commented Jan 5, 2017

SparkQA commented Jan 9, 2017

zhengruifeng commented Jan 9, 2017

jkbradley commented Jan 13, 2017

SparkQA commented Jan 13, 2017

zhengruifeng commented Jan 13, 2017

zhengruifeng commented Jan 16, 2017

jkbradley commented Jan 17, 2017

SparkQA commented Jan 17, 2017

jkbradley commented Jan 17, 2017

zhengruifeng commented Oct 28, 2016 •

edited

Loading

sethah commented Nov 1, 2016 •

edited

Loading